66 research outputs found

    Analysis and Modeling of Advanced PIM Architecture Design Tradeoffs

    Get PDF
    A major trend in high performance computer architecture over the last two decades is the migration of memory in the form of high speed caches onto the microprocessor semiconductor die. Where temporal locality in the computation is high, caches prove very effective at hiding memory access latency and contention for communication resources. However where temporal locality is absent, caches may exhibit low hit rates resulting in poor operational efficiency. Vector computing exploiting pipelined arithmetic units and memory access address this challenge for certain forms of data access patterns, for example involving long contiguous data sets exhibiting high spatial locality. But for many advanced applications for science, technology, and national security at least some data access patterns are not consistent to the restricted forms well handled by either caches or vector processing. An important alternative is the reverse strategy; that of migrating logic in to the main memory (DRAM) and performing those operations directly on the data stored there. Processor in Memory (PIM) architecture has advanced to the point where it may fill this role and provide an important new mechanism for improving performance and efficiency of future supercomputers for a broad range of applications. One important project considering both the role of PIM in supercomputer architecture and the design of such PIM components is the Cray Cascade Project sponsored by the DARPA High Productivity Computing Program. Cascade is a Petaflops scale computer targeted for deployment at the end of the decade that merges the raw speed of an advanced custom vector architecture with the high memory bandwidth processing delivered by an innovative class of PIM architecture. The work represented here was performed under the Cascade project to explore critical design space issues that will determine the value of PIM in supercomputers and contribute to the optimization of its design. But this work also has strong relevance to hybrid systems comprising a combination of conventional microprocessors and advanced PIM based intelligent main memory

    Initial Kernel Timing Using a Simple PIM Performance Model

    Get PDF
    This presentation will describe some initial results of paper-and-pencil studies of 4 or 5 application kernels applied to a processor-in-memory (PIM) system roughly similar to the Cascade Lightweight Processor (LWP). The application kernels are: * Linked list traversal * Sun of leaf nodes on a tree * Bitonic sort * Vector sum * Gaussian elimination The intent of this work is to guide and validate work on the Cascade project in the areas of compilers, simulators, and languages. We will first discuss the generic PIM structure. Then, we will explain the concepts needed to program a parallel PIM system (locality, threads, parcels). Next, we will present a simple PIM performance model that will be used in the remainder of the presentation. For each kernel, we will then present a set of codes, including codes for a single PIM node, and codes for multiple PIM nodes that move data to threads and move threads to data. These codes are written at a fairly low level, between assembly and C, but much closer to C than to assembly. For each code, we will present some hand-drafted timing forecasts, based on the simple PIM performance model. Finally, we will conclude by discussing what we have learned from this work, including what programming styles seem to work best, from the point-of-view of both expressiveness and performance

    Short Conduction Delays Cause Inhibition Rather than Excitation to Favor Synchrony in Hybrid Neuronal Networks of the Entorhinal Cortex

    Get PDF
    How stable synchrony in neuronal networks is sustained in the presence of conduction delays is an open question. The Dynamic Clamp was used to measure phase resetting curves (PRCs) for entorhinal cortical cells, and then to construct networks of two such neurons. PRCs were in general Type I (all advances or all delays) or weakly type II with a small region at early phases with the opposite type of resetting. We used previously developed theoretical methods based on PRCs under the assumption of pulsatile coupling to predict the delays that synchronize these hybrid circuits. For excitatory coupling, synchrony was predicted and observed only with no delay and for delays greater than half a network period that cause each neuron to receive an input late in its firing cycle and almost immediately fire an action potential. Synchronization for these long delays was surprisingly tight and robust to the noise and heterogeneity inherent in a biological system. In contrast to excitatory coupling, inhibitory coupling led to antiphase for no delay, very short delays and delays close to a network period, but to near-synchrony for a wide range of relatively short delays. PRC-based methods show that conduction delays can stabilize synchrony in several ways, including neutralizing a discontinuity introduced by strong inhibition, favoring synchrony in the case of noisy bistability, and avoiding an initial destabilizing region of a weakly type II PRC. PRCs can identify optimal conduction delays favoring synchronization at a given frequency, and also predict robustness to noise and heterogeneity

    A Multidisciplinary Optimization Approach to Integrated Circuit Design

    No full text
    In this paper, we investigate potential applications of multidisciplinary design optimization (MDO) algorithms to integrated circuit design. These algorithms use global sensitivity equations as a tool that provides for a temporary decoupling of the constituent subsystems of a complex system so that designers with expertise in different disciplines can design the subsystems. We develop two example design problems, one with hierarchically coupled subsystems and one with non-hierarchically coupled subsystems, to which we apply multidisciplinary optimization: (1) An example of joint process-circuit optimization, in which a semiconductor fabrication process is tuned concurrently with the design of a CMOS circuit, and (2) an example of circuit -thermal design, in which cell placement and cell design are done simultaneously with regard to the temperature effects on circuit performances. We discuss how this approach enables efficient and concurrent optimization of integrated circuits. 1.0 Intr..

    Measurement and Analysis of Sequential Design Processes

    No full text
    this paper we describe the development of an analytical approach for evaluating sequential design process completion time and for determining the sensitivities of design time with respect to individual task durations and transition probabilities. Techniques are also detailed for collecting process metadata and calibrating a design process model. Example applications illustrate the use of the methodology in analyzing and improving software and hardware design processes. Categories and Subject Descriptors: J.6 [Computer Applications]: Computer-Aided Engineering - computer-aided design; K.3.2 [Computers and Education]: Computer and Information Science Education - self-assessmen
    • …
    corecore